Skip to main content
Version: V11

Text Splitter Node

The Text Splitter Node divides large text documents into smaller, manageable chunks using intelligent strategies with token-accurate or character-based length measurement. It supports recursive (semantic) and fixed (simple) splitting strategies, automatically calculating optimal chunk sizes from model context windows. Cross-chunking mode combines small objects like subtitle segments into larger chunks while preserving metadata ranges.

How It Works

When the node executes, it reads text from the specified input variable and applies the configured splitting strategy. Recursive splitting tries multiple separators hierarchically (paragraphs, sentences, words) to keep semantic units intact, while fixed splitting uses a single separator for straightforward division.

The node can automatically calculate optimal chunk sizes based on the model's context window and reserve ratio, or accept explicit sizes manually. Chunks are created with optional overlap to preserve context at boundaries. Token-based measurement uses model-specific tokenizers for accurate counting, while character-based measurement uses simple string length for faster processing.

Two modes are supported: standard text chunking for plain text, and cross-chunking for combining multiple small objects (like subtitle segments or PDF pages) into larger chunks while preserving metadata ranges.

Configuration Parameters

Input field

Input Field (Text, Required): Workflow variable containing text content.

The node expects text as a string, or an array of objects when cross-chunking is enabled. For cross-chunking, provide objects containing text and metadata fields.

Output field

Output Field (Text, Required): Workflow variable where chunks are stored.

The output is an array of text chunks (strings) for standard chunking, or an array of combined objects for cross-chunking. Each chunk respects configured size limits and overlap settings.

Common naming patterns: chunks, text_chunks, combined_segments.

Splitting mode

Splitting Mode (Dropdown, Default: Character-based): How to measure chunk length.

ModeHow it worksBest forPerformance
Character-basedUses simple string lengthEmbeddings, simple text processingFastest
Token-basedUses tiktoken or HuggingFace tokenizersLLM processing, RAG, accurate token limitsSlower

Token-based ensures chunks fit within model token limits; character-based provides faster processing when exact counts aren't critical.

Splitting strategy

Splitting Strategy (Dropdown, Default: Recursive): How to divide text into chunks.

StrategyHow it worksBest for
Recursive (Semantic)Tries multiple separators hierarchicallyMost use cases - preserves paragraphs, sentences, meaning
Fixed (Simple)Straightforward fixed-size splittingSimple splitting where semantic boundaries don't matter

Recursive strategy is recommended for most applications as it maintains document structure and semantic coherence.

Chunk size mode

Chunk Size Mode (Dropdown, Default: Manual): How to determine chunk size.

ModeHow it worksUse when
Manual (Explicit)Use explicit Chunk Size valuePrecise control over chunk size needed
Auto (Calculated)Calculates from Model Context Window × (1 - Token Reserve Ratio)Chunks optimized for model capacity

Chunk size

Chunk Size (Number, Conditional): Maximum size per chunk.

Required for Manual size mode. Units depend on Splitting Mode: tokens if Token-based, characters if Character-based. Range: 100-200,000. Variable interpolation with ${chunk_size} is supported.

Model context window

Model Context Window (Number, Conditional): Model's maximum context window in tokens.

Required for Auto size mode. Common values: GPT-4: 8,192; GPT-4-turbo: 128,000; Claude-3: 200,000; Llama-3.2: 128,000. Range: 1-200,000 tokens.

Token reserve ratio

Token Reserve Ratio (Number, Default: 0.3): Ratio of tokens to reserve for prompts/responses (0.0-0.9).

Formula: Chunk Size = Model Context Window × (1 - Token Reserve Ratio). Examples: 0.3 (good for RAG/Q&A), 0.2 (good for summarization), 0.4 (good for complex prompts).

Overlap size

Overlap Size (Number, Default: 0): Overlapping units between consecutive chunks.

Helps preserve context at chunk boundaries. Typical values: 10-20% of chunk size. Range: 0-10,000. Variable interpolation with ${overlap} is supported.

Split separators

Split Separators (Array, Default: ["\n\n", "\n", "."]): Separators to split on, in priority order.

Used by Recursive strategy. Maximum 10 separators. Use \\n for newlines, \\t for tabs. The recursive strategy tries each separator in order, using the first that produces chunks within target size.

Keep separator

Keep Separator (Toggle, Default: true): Retain separator at end of each chunk.

Keeping separators preserves document structure (paragraphs, sentences). When disabled, separators are removed, producing cleaner text but losing structural markers.

Use regex separators

Use Regex Separators (Toggle, Default: false): Treat separators as regular expressions.

Allows advanced pattern-based splitting. Keep disabled for normal use with plain text separators.

Minimum chunk size

Minimum Chunk Size (Number, Optional): Minimum acceptable chunk size.

Prevents creation of tiny fragments. Chunks smaller than this may be merged with adjacent chunks or skipped. Range: 10-10,000.

Maximum chunks

Maximum Chunks (Number, Optional): Upper limit on number of chunks returned.

If content would produce more chunks, the result is truncated. Useful for limiting processing on very long documents. Range: 1-10,000.

Tokenization strategy

Tokenization Strategy (Dropdown, Default: Model-based): How tokens are interpreted for Token-based mode.

  • Model-based - Use model's default tokenizer. Requires Model Name parameter.
  • Encoding-based - Specify custom tiktoken encoding. Requires Tiktoken Encoding parameter.

Model name

Model Name (Text, Conditional): Model name for tokenization.

Required when Tokenization Strategy is Model-based. Examples: gpt-4, gpt-3.5-turbo, claude-3-opus, llama3.2, mistral-7b. Variable interpolation with ${model_name} is supported.

Tiktoken encoding

Tiktoken Encoding (Text, Conditional): Tiktoken encoding to use.

Required when Tokenization Strategy is Encoding-based. Common encodings: cl100k_base (GPT-4), p50k_base (GPT-3), r50k_base (GPT-2). Variable interpolation with ${encoding} is supported.

Enable cross-chunking

Enable Cross-Chunking (Toggle, Default: false): Combine multiple small objects into larger chunks.

When enabled, requires Object Mapping configuration to specify how to extract text and metadata from objects.

Cross-chunking configuration

When cross-chunking is enabled, configure attribute mappings:

  • Text Attribute Path: Path to text content using dot notation (e.g., text, pageContent, data.text)
  • Start Metadata Path: Path to start time/number/value (e.g., start_time, metadata.start, page)
  • End Metadata Path: Path to end time/number/value (e.g., end_time, metadata.end, page)
  • Metadata Type: Type for future smart merging (Time for timestamps, Number for page numbers)

Common parameters

This node supports common parameters shared across workflow nodes, including Stream Output Response, Streaming Messages, Logging Mode, and Wait For All Edges. For detailed information, see Common Parameters.

Best practices

  • Use token-based mode when processing text for LLMs to ensure chunks fit within model limits and avoid truncation
  • Set overlap to 10-20% of chunk size to preserve context at boundaries, especially for RAG applications
  • Adjust Token Reserve Ratio based on prompt complexity: 0.2-0.3 for simple prompts, 0.3-0.4 for complex prompts with examples
  • Customize separators based on content type: paragraph and sentence separators for prose, code-specific separators for source code
  • Test different chunk sizes: smaller chunks (200-500 tokens) for precise retrieval, larger chunks (1000-2000 tokens) for summarization
  • For cross-chunking subtitle segments or timed data, ensure objects are sorted chronologically before processing

Limitations

  • Token counting accuracy: Token-based mode requires model-specific tokenizers. If unavailable, the node falls back to character-based mode.
  • Memory usage: Processing very large documents with small chunk sizes can create thousands of chunks. Use Maximum Chunks to limit output.
  • Cross-chunking requirements: Requires properly structured input objects with consistent attribute paths. Missing or malformed attributes cause objects to be skipped.
  • Separator handling: Regex separators require valid patterns. Invalid patterns cause failure.
  • Overlap constraints: Overlap size cannot exceed half the chunk size. The node automatically clamps overlap to chunk_size / 2 if exceeded.